In [2]:
import graphlab

Load music data


In [8]:
song_data = graphlab.SFrame('song_data.gl/')


[INFO] This non-commercial license of GraphLab Create is assigned to sandipto.neogi@gmail.com and will expire on October 11, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-1324 - Server binary: /Users/sandiptoneogi/anaconda/envs/dato/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1446989868.log
[INFO] GraphLab Server Version: 1.6.1

Explore data


In [11]:
song_data.head(2)


Out[11]:
user_id song_id listen_count title artist
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOAKIMP12A8C130995 1 The Cove Jack Johnson
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOBBMDR12A8C13253B 2 Entre Dos Aguas Paco De Lucia
song
The Cove - Jack Johnson
Entre Dos Aguas - Paco De
Lucia ...
[2 rows x 6 columns]


In [12]:
graphlab.canvas.set_target('ipynb')

In [13]:
song_data['song'].show()



In [14]:
len(song_data)


Out[14]:
1116609

Count number of users


In [52]:
users = song_data['user_id'].unique().sort()

In [53]:
len(users)


Out[53]:
66346

Create a song recommender


In [54]:
train_data, test_data = song_data.random_split(0.8, seed=0)

Simple popularity based recommender


In [55]:
popularity_model = graphlab.popularity_recommender.create(train_data,
                                                          user_id='user_id',
                                                          item_id='song')


PROGRESS: Recsys training: model = popularity
PROGRESS: Warning: Ignoring columns song_id, listen_count, title, artist;
PROGRESS:     To use one of these as a target column, set target = <column_name>
PROGRESS:     and use a method that allows the use of a target.
PROGRESS: Preparing data set.
PROGRESS:     Data has 893580 observations with 66085 users and 9952 items.
PROGRESS:     Data prepared in: 1.54642s
PROGRESS: 893580 observations to process; with 9952 unique items.

Use popularity model to make predictions


In [56]:
popularity_model.recommend(users=[users[0]])


Out[56]:
user_id song score rank
00003a4459f33b92906be11ab
e0e93efc423c0ff ...
Sehr kosmisch - Harmonia 4754.0 1
00003a4459f33b92906be11ab
e0e93efc423c0ff ...
Undo - Björk 4227.0 2
00003a4459f33b92906be11ab
e0e93efc423c0ff ...
You're The One - Dwight
Yoakam ...
3781.0 3
00003a4459f33b92906be11ab
e0e93efc423c0ff ...
Dog Days Are Over (Radio
Edit) - Florence + The ...
3633.0 4
00003a4459f33b92906be11ab
e0e93efc423c0ff ...
Revelry - Kings Of Leon 3527.0 5
00003a4459f33b92906be11ab
e0e93efc423c0ff ...
Horn Concerto No. 4 in E
flat K495: II. Romance ...
3161.0 6
00003a4459f33b92906be11ab
e0e93efc423c0ff ...
Secrets - OneRepublic 3148.0 7
00003a4459f33b92906be11ab
e0e93efc423c0ff ...
Hey_ Soul Sister - Train 2538.0 8
00003a4459f33b92906be11ab
e0e93efc423c0ff ...
Fireflies - Charttraxx
Karaoke ...
2532.0 9
00003a4459f33b92906be11ab
e0e93efc423c0ff ...
Tive Sim - Cartola 2521.0 10
[10 rows x 4 columns]


In [57]:
popularity_model.recommend(users=[users[1]])


Out[57]:
user_id song score rank
00005c6177188f12fb5e2e82c
dbd93e8a3f35e64 ...
Sehr kosmisch - Harmonia 4754.0 1
00005c6177188f12fb5e2e82c
dbd93e8a3f35e64 ...
Undo - Björk 4227.0 2
00005c6177188f12fb5e2e82c
dbd93e8a3f35e64 ...
You're The One - Dwight
Yoakam ...
3781.0 3
00005c6177188f12fb5e2e82c
dbd93e8a3f35e64 ...
Dog Days Are Over (Radio
Edit) - Florence + The ...
3633.0 4
00005c6177188f12fb5e2e82c
dbd93e8a3f35e64 ...
Revelry - Kings Of Leon 3527.0 5
00005c6177188f12fb5e2e82c
dbd93e8a3f35e64 ...
Horn Concerto No. 4 in E
flat K495: II. Romance ...
3161.0 6
00005c6177188f12fb5e2e82c
dbd93e8a3f35e64 ...
Secrets - OneRepublic 3148.0 7
00005c6177188f12fb5e2e82c
dbd93e8a3f35e64 ...
Hey_ Soul Sister - Train 2538.0 8
00005c6177188f12fb5e2e82c
dbd93e8a3f35e64 ...
Fireflies - Charttraxx
Karaoke ...
2532.0 9
00005c6177188f12fb5e2e82c
dbd93e8a3f35e64 ...
Tive Sim - Cartola 2521.0 10
[10 rows x 4 columns]

Create a recommender with personilazation


In [58]:
personalized_model = graphlab.item_similarity_recommender.create(train_data,
                                                                 user_id='user_id',
                                                                 item_id='song')


PROGRESS: Recsys training: model = item_similarity
PROGRESS: Warning: Ignoring columns song_id, listen_count, title, artist;
PROGRESS:     To use one of these as a target column, set target = <column_name>
PROGRESS:     and use a method that allows the use of a target.
PROGRESS: Preparing data set.
PROGRESS:     Data has 893580 observations with 66085 users and 9952 items.
PROGRESS:     Data prepared in: 1.44485s
PROGRESS: Computing item similarity statistics:
PROGRESS: Computing most similar items for 9952 items:
PROGRESS: +-----------------+-----------------+
PROGRESS: | Number of items | Elapsed Time    |
PROGRESS: +-----------------+-----------------+
PROGRESS: | 1000            | 0.779933        |
PROGRESS: | 2000            | 0.857098        |
PROGRESS: | 3000            | 0.934396        |
PROGRESS: | 4000            | 1.01119         |
PROGRESS: | 5000            | 1.08694         |
PROGRESS: | 6000            | 1.15783         |
PROGRESS: | 7000            | 1.22701         |
PROGRESS: | 8000            | 1.30828         |
PROGRESS: | 9000            | 1.3963          |
PROGRESS: +-----------------+-----------------+
PROGRESS: Finished training in 1.84045s

Use personalized model to make predictions


In [59]:
personalized_model.recommend(users=[users[0]])


Out[59]:
user_id song score rank
00003a4459f33b92906be11ab
e0e93efc423c0ff ...
Furious Rose - Lisa Loeb 0.0528140643589 1
00003a4459f33b92906be11ab
e0e93efc423c0ff ...
Utopian Dream - Dimension
5 ...
0.047619047619 2
00003a4459f33b92906be11ab
e0e93efc423c0ff ...
W - Perfect Stranger 0.0450842038533 3
00003a4459f33b92906be11ab
e0e93efc423c0ff ...
Pop Champagne - Jim Jones
& Ron Browz featuring ...
0.0337295764692 4
00003a4459f33b92906be11ab
e0e93efc423c0ff ...
One Last Kiss (LP
Version) - Madina Lake ...
0.0309756983078 5
00003a4459f33b92906be11ab
e0e93efc423c0ff ...
All I Need - [re:jazz] 0.0213195650601 6
00003a4459f33b92906be11ab
e0e93efc423c0ff ...
Death Is The Road To Awe
- Clint Mansell ...
0.020618556701 7
00003a4459f33b92906be11ab
e0e93efc423c0ff ...
Hunger Strike - Temple Of
The Dog ...
0.0205128205128 8
00003a4459f33b92906be11ab
e0e93efc423c0ff ...
Beatbox (Nus Track) -
Arma Blanca ...
0.0158508158508 9
00003a4459f33b92906be11ab
e0e93efc423c0ff ...
Crank That (Soulja Boy) -
Soulja Boy Tell'em ...
0.0138282387191 10
[10 rows x 4 columns]


In [60]:
personalized_model.recommend(users=[users[1]])


Out[60]:
user_id song score rank
00005c6177188f12fb5e2e82c
dbd93e8a3f35e64 ...
Hand In Glove - The
Smiths ...
0.0676229508197 1
00005c6177188f12fb5e2e82c
dbd93e8a3f35e64 ...
Pretty Girls Make Graves
- The Smiths ...
0.0511811023622 2
00005c6177188f12fb5e2e82c
dbd93e8a3f35e64 ...
Reel Around The Fountain
- The Smiths ...
0.048275862069 3
00005c6177188f12fb5e2e82c
dbd93e8a3f35e64 ...
Asleep - The Smiths 0.0472440944882 4
00005c6177188f12fb5e2e82c
dbd93e8a3f35e64 ...
Still Ill - The Smiths 0.046082266477 5
00005c6177188f12fb5e2e82c
dbd93e8a3f35e64 ...
This Charming Man - The
Smiths ...
0.0445544554455 6
00005c6177188f12fb5e2e82c
dbd93e8a3f35e64 ...
Miserable Lie - The
Smiths ...
0.0431818181818 7
00005c6177188f12fb5e2e82c
dbd93e8a3f35e64 ...
The Hand That Rocks The
Cradle (+ Sonny Boy) - ...
0.0404761904762 8
00005c6177188f12fb5e2e82c
dbd93e8a3f35e64 ...
You've Got Everything Now
- The Smiths ...
0.0401785714286 9
00005c6177188f12fb5e2e82c
dbd93e8a3f35e64 ...
Chump (Album Version) -
Green Day ...
0.03125 10
[10 rows x 4 columns]


In [61]:
personalized_model.get_similar_items(['With Or Without You - U2'])


PROGRESS: Getting similar items completed in 0.002401
Out[61]:
song similar score rank
With Or Without You - U2 I Still Haven't Found
What I'm Looking For ...
0.0428571428571 1
With Or Without You - U2 Hold Me_ Thrill Me_ Kiss
Me_ Kill Me - U2 ...
0.033734939759 2
With Or Without You - U2 Window In The Skies - U2 0.0328358208955 3
With Or Without You - U2 Vertigo - U2 0.0300751879699 4
With Or Without You - U2 Sunday Bloody Sunday - U2 0.0271317829457 5
With Or Without You - U2 Bad - U2 0.0251798561151 6
With Or Without You - U2 A Day Without Me - U2 0.0237154150198 7
With Or Without You - U2 Another Time Another
Place - U2 ...
0.020325203252 8
With Or Without You - U2 Walk On - U2 0.020202020202 9
With Or Without You - U2 Get On Your Boots - U2 0.0196850393701 10
[10 rows x 4 columns]


In [62]:
personalized_model.get_similar_items(['Chan Chan (Live) - Buena Vista Social Club'])


PROGRESS: Getting similar items completed in 0.003128
Out[62]:
song similar score rank
Chan Chan (Live) - Buena
Vista Social Club ...
Murmullo - Buena Vista
Social Club ...
0.188118811881 1
Chan Chan (Live) - Buena
Vista Social Club ...
La Bayamesa - Buena Vista
Social Club ...
0.187192118227 2
Chan Chan (Live) - Buena
Vista Social Club ...
Amor de Loca Juventud -
Buena Vista Social Club ...
0.184834123223 3
Chan Chan (Live) - Buena
Vista Social Club ...
Diferente - Gotan Project 0.0214592274678 4
Chan Chan (Live) - Buena
Vista Social Club ...
Mistica - Orishas 0.0205761316872 5
Chan Chan (Live) - Buena
Vista Social Club ...
Hotel California - Gipsy
Kings ...
0.019305019305 6
Chan Chan (Live) - Buena
Vista Social Club ...
Nací Orishas - Orishas 0.0191570881226 7
Chan Chan (Live) - Buena
Vista Social Club ...
Le Moulin - Yann Tiersen 0.0187969924812 8
Chan Chan (Live) - Buena
Vista Social Club ...
Gitana - Willie Colon 0.0187969924812 9
Chan Chan (Live) - Buena
Vista Social Club ...
Criminal - Gotan Project 0.018779342723 10
[10 rows x 4 columns]

Quantitative comparison of models


In [63]:
%matplotlib inline
model_performance = graphlab.recommender.util.compare_models(test_data,
                                                            [popularity_model, personalized_model],
                                                            user_sample=0.05)


compare_models: using 2931 users to estimate model performance
PROGRESS: Evaluate model M0
PROGRESS: recommendations finished on 1000/2931 queries. users per second: 7639.77
PROGRESS: recommendations finished on 2000/2931 queries. users per second: 10153.3

Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|   1    | 0.0320709655408 | 0.00773375426651 |
|   2    | 0.0312180143296 | 0.0163930183505  |
|   3    | 0.0282042533834 | 0.0223414644986  |
|   4    | 0.0256738314568 | 0.0268875036787  |
|   5    | 0.0235414534289 | 0.0302122656524  |
|   6    | 0.0223473217332 |  0.034638376143  |
|   7    | 0.0209094896915 | 0.0379717821888  |
|   8    | 0.0200017059024 |   0.0408634468   |
|   9    | 0.0188407445316 | 0.0436274148915  |
|   10   | 0.0182872739679 | 0.0466276075795  |
+--------+-----------------+------------------+
[10 rows x 3 columns]
[WARNING] Model trained without a target. Skipping RMSE computation.
PROGRESS: Evaluate model M1
PROGRESS: recommendations finished on 1000/2931 queries. users per second: 948.086
PROGRESS: recommendations finished on 2000/2931 queries. users per second: 1010.25

Precision and recall summary statistics by cutoff
+--------+-----------------+-----------------+
| cutoff |  mean_precision |   mean_recall   |
+--------+-----------------+-----------------+
|   1    |  0.202320027294 | 0.0628340461872 |
|   2    |  0.172466734903 | 0.0992709849896 |
|   3    |  0.150915500967 |  0.127443749997 |
|   4    |  0.132889798704 |  0.147718441472 |
|   5    |  0.120095530536 |  0.16467162485  |
|   6    |  0.109860116001 |  0.180725746007 |
|   7    |  0.101525564166 |  0.193899819978 |
|   8    | 0.0939099283521 |  0.204401687798 |
|   9    | 0.0882520186512 |  0.214395696166 |
|   10   | 0.0837598089389 |  0.22591288363  |
+--------+-----------------+-----------------+
[10 rows x 3 columns]
[WARNING] Model trained without a target. Skipping RMSE computation.


In [64]:
import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots()

pr_curves_by_model = [res['precision_recall_overall'] for res in model_performance]

pr_curve = pr_curves_by_model[0].sort('recall')
ax.plot(list(pr_curve['recall']), list(pr_curve['precision']),
        'blue', label='M1')

pr_curve = pr_curves_by_model[1].sort('recall')
ax.plot(list(pr_curve['recall']), list(pr_curve['precision']),
        'green', label='M2')

ax.set_title('Precision-Recall Averaged Over Users')
ax.set_xlabel('Recall')
ax.set_ylabel('Precision')
ax.legend()

fig.show()


Assignment


In [66]:
user_kanye_west = song_data[song_data['artist']=='Kanye West']['user_id'].unique().sort()

In [67]:
len(user_kanye_west)


Out[67]:
2522

In [69]:
user_foo_fighters = song_data[song_data['artist']=='Foo Fighters']['user_id'].unique().sort()

In [70]:
len(user_foo_fighters)


Out[70]:
2055

In [71]:
user_taylor_swift = song_data[song_data['artist']=='Taylor Swift']['user_id'].unique().sort()

In [72]:
len(user_taylor_swift)


Out[72]:
3246

In [73]:
user_lady_gaga = song_data[song_data['artist']=='Lady GaGa']['user_id'].unique().sort()

In [74]:
len(user_lady_gaga)


Out[74]:
2928

In [89]:
artist_listen_count = song_data.groupby(key_columns='artist', operations={'listen_count': graphlab.aggregate.SUM('listen_count')})

In [94]:
artist_listen_count[artist_listen_count['artist']=='Taylor Swift']


Out[94]:
artist listen_count
Taylor Swift 19376
[? rows x 2 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.


In [95]:
artist_listen_count[artist_listen_count['artist']=='Kings Of Leon']


Out[95]:
artist listen_count
Kings Of Leon 43218
[? rows x 2 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.


In [96]:
artist_listen_count[artist_listen_count['artist']=='Coldplay']


Out[96]:
artist listen_count
Coldplay 35362
[? rows x 2 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.


In [97]:
artist_listen_count[artist_listen_count['artist']=='Lady GaGa']


Out[97]:
artist listen_count
Lady GaGa 12224
[? rows x 2 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.


In [99]:
artist_listen_count[artist_listen_count['artist'] == 'William Tabbert']


Out[99]:
artist listen_count
William Tabbert 14
[? rows x 2 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.


In [100]:
artist_listen_count[artist_listen_count['artist']=='Velvet Underground & Nico']


Out[100]:
artist listen_count
Velvet Underground & Nico 80
[? rows x 2 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.


In [101]:
artist_listen_count[artist_listen_count['artist']=='Kanye West']


Out[101]:
artist listen_count
Kanye West 9992
[? rows x 2 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.


In [102]:
artist_listen_count[artist_listen_count['artist']=='The Cool Kids']


Out[102]:
artist listen_count
The Cool Kids 73
[? rows x 2 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.


In [108]:
test_data_10000 = test_data['user_id'].unique().sort()[0:10000]

In [110]:
recommended_songs = personalized_model.recommend(test_data_10000, k=1)


PROGRESS: recommendations finished on 1000/10000 queries. users per second: 947.102
PROGRESS: recommendations finished on 2000/10000 queries. users per second: 1014.67
PROGRESS: recommendations finished on 3000/10000 queries. users per second: 1043.91
PROGRESS: recommendations finished on 4000/10000 queries. users per second: 1066.09
PROGRESS: recommendations finished on 5000/10000 queries. users per second: 1071.85
PROGRESS: recommendations finished on 6000/10000 queries. users per second: 1071.84
PROGRESS: recommendations finished on 7000/10000 queries. users per second: 1070.61
PROGRESS: recommendations finished on 8000/10000 queries. users per second: 1076.2
PROGRESS: recommendations finished on 9000/10000 queries. users per second: 1080.27
PROGRESS: recommendations finished on 10000/10000 queries. users per second: 1071.51

In [111]:
recommended_songs.head(2)


Out[111]:
user_id song score rank
00003a4459f33b92906be11ab
e0e93efc423c0ff ...
Furious Rose - Lisa Loeb 0.0528140643589 1
00005c6177188f12fb5e2e82c
dbd93e8a3f35e64 ...
Hand In Glove - The
Smiths ...
0.0676229508197 1
[2 rows x 4 columns]


In [118]:
recommended_songs.groupby(key_columns='song', operations={'count': graphlab.aggregate.COUNT()}).sort('count', ascending=False)


Out[118]:
song count
Undo - Björk 428
Secrets - OneRepublic 410
Revelry - Kings Of Leon 224
You're The One - Dwight
Yoakam ...
187
Hey_ Soul Sister - Train 139
Fireflies - Charttraxx
Karaoke ...
112
Sehr kosmisch - Harmonia 88
Horn Concerto No. 4 in E
flat K495: II. Romance ...
80
Dog Days Are Over (Radio
Edit) - Florence + The ...
55
The Scientist - Coldplay 50
[3125 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [ ]:


In [ ]: